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1  SUMMARY 


This  project  was  centered  on  the  following  challenges.  Limited  resusability  and  abstraction  of 
certified  code:  The  fact  that  any  change  to  a  system  must  trigger  re-certification  is  untenable.  The 
project  considered  mechanisms  to  raise  the  level  of  abstraction  in  certification  so  that  it  becomes 
possible  to  reason  about  correctness  at  a  higher-level  using  rigorously-defined  unambiguous  se¬ 
mantics  for  all  intermediate  representations  in  the  compilation  process,  rather  than  at  the  generated 
assembly  level.  Scaling  validation  and  composability:  The  size  of  modern  code  bases  suggests 
that  different  levels  of  modularity  in  verification  are  required.  We  considered  architectures  and 
principles  that  facilitate  modular  verification  at  different  levels  of  granularity.  Concurrency:  Even 
though  concurrency  is  a  pervasive  part  of  modern  software  and  hardware  systems,  it  has  often  been 
ignored  in  safety-critical  system  designs.  A  major  focus  of  this  effort  was  centered  on  how  best  to 
reason  about  concurrency  as  an  intrinsic  feature  by  assuming  that  all  activities  execute  on  multi¬ 
core  hardware  with  potentially  relaxed  memory,  relying  on  precise  memory  model  specifications 
at  both  the  language  and  architecture  level  to  reason  about  possible  behaviors. 

New  verification  approaches  and  methodologies  lie  at  the  heart  of  our  answer  to  these  challenges. 
In  particular,  ensuring  the  correctness  of  the  translation  from  source  to  target  effected  by  a  compiler 
is  a  critical  pre-requiste  to  building  an  automatically  certified  software  stack.  The  existence  of  such 
an  artifact  would  dramatically  change  the  safety-critical  application  landscape,  relieving  the  need 
for  costly  manual  inspection  of  source  and  binary,  enabling  a  richer  class  of  optimizations,  leading 
to  more  efficient  and  scalable  applications.  Specifically,  we  addressed  the  challenges  enumerated 
above  in  the  following  ways.  Reusability  and  abstraction  is  achieved  through  the  use  of  high-level 
type-safe  language  like  Java,  rather  than  a  low-level  one  like  C,  enabling  us  to  reason  about  cor¬ 
rectness  in  terms  of  precise  source-language  invariants.  Scaling  and  composability  was  achieved 
by  defining  new  modular  proof  techniques  to  aid  the  compiler  writer  in  proving  the  correctness 
of  optimizations,  even  in  the  presence  of  sophisticated  managed  (concurrent)  runtime  services  like 
garbage  collectors.  Important  issues  related  to  concurrency  were  addressed  by  refining  the  existing 
Java  memory  model  to  make  it  more  amenable  for  incorporation  within  a  verified  compiler. 


2  INTRODUCTION 

There  were  three  major  activities  undertaken  during  the  lifetime  of  this  effort  that  built  on  these  ac¬ 
tivities.  The  first  was  to  explore  the  construction  and  compilation  methodology  of  an  intermediate 
representation  capable  of  facilitating  correctness  proofs  on  the  behavior  of  concurrently  exeucting 
runtime  services.  The  second  was  the  development  of  a  precise  operationally  defined  memory 
model  that  relates  the  definition  of  the  Java  Memory  Model  (JMM)  with  weak  microprocessor  ar¬ 
chitectures  like  IBM’s  Power.  The  third  was  the  specification  and  verification  of  a  state-of-the-art 
concurrent  garbage  collector  as  a  substantial  demonstration  of  the  efficacy  of  our  ideas.  All  three 
activities  share  the  overarching  goal  of  developing  strong  (mechanically  checkable)  safety  guar¬ 
antees  for  high-level  language  implementations  built  on  top  of  sophisticated  runtime  services  and 
architectural  platforms. 
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2.1  Proof  Methodology 


Our  first  significant  resuit  was  the  development  of  a  new  intermediate  representation  and  associ¬ 
ated  proof  methodology  to  facilitate  certified  compilation  of  high-level  managed  languages  like 
Java  or  C#.  Managed  languages  provide  intrinsic  support  for  concurrency  at  several  levels.  Ap¬ 
plications  can  express  concurrent  computations  using  threads  and  synchronization  primitives.  Ad¬ 
ditionally,  to  improve  scalability  or  performance,  elements  of  the  language  implementation  itself 
may  run  concurrently  with  application  threads.  The  interactions  between  application  threads  and 
the  diverse  components  of  the  language  runtime  system  are  regulated  by  compiler-injected  code 
snippets.  Typical  examples  of  injected  code  include  allocation  fast  paths,  read  and  write  barriers, 
synchronization  fences,  and  initialization  checks.  These  concurrent  snippets  are  sophisticated,  of¬ 
ten  racy,  and  must  operate  correctly  in  an  environment  subject  to  program  transformations,  both 
local  and  global.  The  subtleties  involved  in  dealing  with  these  low-level  code  fragments  within 
the  context  of  already  complex  source  and  target  languages  justify  the  effort  of  adopting  a  verified 
compilation  strategy.  However,  verifying  the  correctness  of  a  compiler  for  these  kinds  of  languages 
is  a  challenging  and  ambitious  goal  as  it  entails  reasoning  about  the  inherently  parallel  behavior  of 
concurrent  operations  in  the  source  language,  as  well  as  the  possibly  racy,  non-atomic,  operations 
introduced  by  the  compiler.  Low-level  implementations  provide  a  performant  variant  of  high-level 
specifications  that  are  exploited  by  the  compiler.  Reconciling  the  dichotomy  between  these  two  ab¬ 
straction  layers  is  key  to  any  feasible  verification  strategy.  To  do  so,  we  developed  new  refinement 
predicates  that  relate  the  “high”  and  “low”  definitions  of  concurrent  code.  Informally,  we  say  that 
a  low-level  statement  l  refines  a  high-level  one  h  if  the  execution  of  both  l  and  h  starting  from  the 
same  state  leads  to  the  same  final  state;  furthermore,  if  executing  l  admits  a  trace  tr  of  interleaved 
actions  of  other  threads,  then  tr  must  be  admissible  as  a  feasible  trace  under  the  execution  of  h. 
This  notion  of  refinement  guarantees  the  equivalence  of  high  and  low-level  code.  Given  a  high- 
level  specification  h  that  captures  the  atomicity  properties  implicit  in  /,  the  refinement  predicate 
helps  the  compiler  writer  devise  a  proof  that  l  refines  h. 

While  recent  years  have  seen  progress  in  compiler  verification,  much  of  this  work  has  been  for 
sequential  languages  like  C.  The  basic  correctness  argument  requires  proving  that  any  behavior 
admitted  by  the  compiled  program  is  also  admitted  by  the  source.  This  is  typically  shown  by  a 
backward  simulation  proof  between  target  and  source  language  semantics.  Assuming  the  source 
program  is  safe,  a  backward  simulation  demonstrates  that  any  observable  behavior  produced  by  the 
target  program  is  a  valid  observable  behavior  of  the  source  program  as  defined  by  the  source  lan¬ 
guage  semantics.  Demonstrating  such  a  simulation  is  complicated  by  the  presence  of  concurrency. 
Managed  languages  add  further  complications  because  they  often  compile  a  single  source  memory 
access  to  multiple  low-level  memory  accesses,  as  a  result  of  code  injected  by  the  compiler.  For 
example,  Java  compilers  typically  inject  write  barriers  before  each  field  update  to  support  garbage 
collection.  Indeed,  implementation  of  performant  write  barriers  typically  use  a  non-trivial  protocol 
to  communicate  with  the  garbage  collector  thread,  and  serves  to  notify  that  changes  are  being  done 
in  the  object  graph.  Dealing  with  concurrency  is  thus  quite  challenging  since  it  requires  proving 
concurrent  invariants  of  the  underlying  implementation  of  the  compiler  and  runtime  system,  inter¬ 
nal  data  structures,  and  communication  protocols.  The  details  of  these  protocols  are  not  visible  to 
the  high-level  source.  Consequently,  a  naive  approach  to  verification  of  injected  concurrent  code 
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fragments  is  not  scalable  using  a  standard  backward  simulation  argument. 

To  address  this  challenge  in  verified  compilation  of  managed  languages,  we  developed  an  atomicity 
refinement  methodology  that  coarsens  the  granularity  of  injected  pieces  of  code,  thereby  simpli¬ 
fying  the  overall  verification  of  the  compiler  infrastructure.  Our  approach  facilitates  the  modular 
expression  of  such  proofs,  making  a  backward  simulation  argument  feasible  by  establishing  the 
equivalence  of  fine-grained  and  coarse-grained  representations  of  concurrency  operations,  in  iso¬ 
lation  of  the  other  components  in  the  program.  The  refinement  enables  a  simulation  argument 
similar  to  the  ones  used  to  demonstrate  the  correctness  of  sequential  optimizations,  and  hence  al¬ 
lows  such  arguments  to  be  effectively  applied  to  potentially  racy,  lock-free,  concurrent  code.  This 
particular  approach  is  motivated  by  the  premise  that  establishing  that  the  high-level  specification 
captures  the  behavior  defined  by  the  source  program  is  substantially  easier  than  directly  proving 
the  correspondence  between  low-level  target  and  source. 

2.2  Memory  Models 

Our  second  major  result  concerns  the  verification  of  memory  models  within  the  compiler  toolchain. 
The  Java  Memory  Model  is  intended  to  characterize  the  meaning  of  concurrent  Java  programs. 
Because  of  the  model’s  complexity,  however,  its  definition  cannot  be  easily  transplanted  within 
an  optimizing  Java  compiler,  even  though  an  important  rationale  for  its  design  was  to  ensure  Java 
compiler  optimizations  are  not  unduly  hampered  because  of  the  language’s  concurrency  features. 
In  response,  the  JSR-133  Cookbook  for  Compiler  Writers ,  an  informal  guide  to  realizing  the  prin¬ 
ciples  underlying  the  JMM  on  different  (relaxed-memory)  platforms  was  developed.  The  goal 
of  the  cookbook  is  to  give  compiler  writers  a  relatively  simple,  yet  reasonably  efficient,  set  of 
reordering-based  recipes  that  satisfy  JMM  constraints. 

To  aid  our  certification  effort,  we  embarked  on  a  formalization  of  the  cookbook,  providing  a  se¬ 
mantic  basis  upon  which  the  relationship  between  the  recipes  defined  by  the  cookbook  and  the 
guarantees  enforced  by  the  JMM  can  be  rigorously  established.  Notably,  one  artifact  of  our  inves¬ 
tigation  is  that  the  rules  defined  by  the  cookbook  for  compiling  Java  onto  the  Power  microprocessor 
are  inconsistent  with  the  requirements  of  the  JMM,  a  surprising  result,  and  one  which  justifies  our 
belief  in  the  need  for  formally  provable  definitions  to  reason  about  sophisticated  (and  racy)  con¬ 
currency  patterns  in  Java,  and  their  implementation  on  modern-day  relaxed-memory  hardware. 

Our  formalization  enables  simulation  arguments  between  an  architecture-independent  intermediate 
representation  of  the  kind  suggested  by  the  cookbook  with  machine  abstractions  for  Power  and  x86. 
Our  results  not  only  provided  fixes  for  cookbook  recipes  that  are  inconsistent  with  the  behaviors 
admitted  by  the  target  platform,  but  also  proved  the  correctness  of  these  repairs,  and  enabled  us  to 
use  these  verified  recipes  within  our  compiler  toolchain. 

2.3  Runtime  System  Verification 

Concurrent  garbage  collection  algorithms  are  an  emblematic  challenge  in  the  area  of  concurrent 
program  verification.  We  considered  tackling  this  problem  by  proposing  a  mechanized  proof 
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methodology  based  on  the  popular  Rely-Guarantee  (RG)  proof  technique.  We  designed  a  spe¬ 
cific  compiler  intermediate  representation  (IR)  with  strong  type  guarantees,  dedicated  support  for 
abstract  concurrent  data  structures,  parametric  specifications  of  memory  model  behavior,  and  high- 
level  iterators  on  runtime  internals.  In  addition,  we  defined  an  RG  program  logic  supporting  an 
incremental  proof  methodology  where  annotations  and  invariants  can  be  progressively  enriched. 

We  have  formalized  the  IR,  the  proof  system,  and  have  proven  the  soundness  of  the  methodology 
in  the  Coq  proof  assistant.  Equipped  with  this  IR,  we  were  able  to  prove  a  fully  concurrent  garbage 
collector  where  mutators  never  have  to  wait  for  the  collector. 


3  METHODS,  ASSUMPTIONS,  AND  PROCEDURES 

This  effort  is  concerned  with  the  verified  compilation  of  high-level  managed  languages  like  Java 
or  C#  whose  intermediate  representations  provide  support  for  shared-memory  synchronization  and 
automatic  memory  management.  In  this  environment,  the  interactions  between  application  threads 
and  the  language  runtime  (e.g.,  the  schedulers,  memory  managers,  etc.)  are  regulated  by  compiler- 
injected  code  snippets.  Example  of  snippets  include  allocation  fast  paths,  read  and  write  barriers, 
synchronization  fences  and  data  initialization  checks.  For  performance,  the  code  injected  by  the 
compiler  is  often  sophisticated,  and  racy,  but  must  nonetheless  operate  correctly  in  the  presence  of 
program  transformations,  both  local  and  global.  This  entails  reasoning  about  the  inherently  parallel 
behavior  of  operations  in  the  source  language,  as  well  as  the  operations  introduced  by  the  compiler. 
A  naive  approach  would  entail  examination  of  all  possible  thread  interleavings,  an  impractical  and 
non-scalable  exercise. 

To  tackle  this  problem,  we  developed  a  general  and  flexible  atomicity  refinement  technique  that 
increases  the  granularity  of  injected  snippets  of  code,  hence  facilitating  simulation  proofs  between 
bytecode  computations,  and  their  racy,  fine-grained  implementations.  We  illustrate  our  approach 
by  considering  the  following  low-level  code  snippet  that  attempts  to  acquire  a  lock  (akin  to  a  high- 
level  monitorenter  bytecode  instruction  of  Java): 

repeat  { 

old  :=  cas (Lock,  0,  1); 
current  :=  old; 
while  (current  !=  0)  do 
current  :=  Lock; 

}  until  (old  ==  0) 

In  the  implementation  on  the  left,  lock  acquisiton  requires  potentially  multiple  iterations  of  a  loop 
that  attempt  to  change  the  global  variable  Lock  from  0  to  1  through  a  cas  (compare-and-set) 
instruction.  On  the  other  hand,  the  code  on  the  right  is  atomic,  and  only  proceeds  if  the  Lock  vari¬ 
able  is  0  (the  semantics  of  assume  guarantees  that).  It  is  obviously  easier  to  match  the  semantics 
of  monitorenter  with  the  code  on  the  right.  Our  refinement  technique  can  establish  that  the 
atomic  piece  of  code  on  the  right  simulates  the  low-level  implementation  on  the  left,  simplifying 
verification  burden.  Moreover,  we  designed  our  framework  to  be  cognizant  of  weak  or  relaxed 
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Language :  1 

s  E  1 

::=  skip  d  =  op(r )  |  s;  s  if  c  then  s  else  s 

repeat  s  until  c 

load u{d,r)  store,, (d,  r)  cas(d,  r,  o,  n) 

fence  abort 

V 

atomics  assume c  branch  s,  s  loops 
::=  Local  e 

Figure  1:  Syntax  of  X. 

memory  model  behavior  such  as  those  defined  by  the  Total  Store  Ordering  (TSO)  relaxed  memory 
model  of  Intel  x86  processor  ||63j,  thus  rendering  it  applicable  in  realistic  environments. 


We  illustrate  our  technique  via  three  simple,  yet  representative,  rules: 


DeadCode 
si  is  dead 

si;  s2  ^  s2 

CAS-success 


GrowAtomicLocal 

d  is  known  to  be  local  in  the  context 

d  :=  Z;  atomic{  s  }  ^  atomic{  d  :=  Z;  s  } 


d  :=  cas(r,  vq,vi);  assume(t>o  =  d)  =4  atomicj  d  :  =  load  r;  assume(t>o  =  d)\  r  :=  v\  } 


Rule  DeadCode  states  that  code  which  affects  variables  that  are  not  used  later  can  simply  be 
discarded.  Rule  GrowAtomicLocal  states  that  if  a  certain  variable  is  known  to  be  local  in  a 
certain  context,  then  accesses  to  this  variable  can  be  considered  as  happening  atomically  with  the 
code  that  follows.  Finally,  CAS-success  establishes  that  a  successful  cas  operation  can  be 
treated  as  an  atomic  operation.  These  rules  form  the  core  of  the  proof  that  the  spin-lock  example 
presented  before  is  soundly  abstracted  as  an  atomic  block. 


We  have  implemented  a  certified  compiler  for  Java  that  implements  and  proves  the  soundness  of 
our  atomicity  refinement  technique  as  an  extension  of  the  CompcertTSO  verified  compiler  [|76|. 
Because  our  technique  allows  the  compiler  writer  to  reason  compositionally  about  the  atomicity  of 
low-level  concurrent  code  used  to  implement  managed  services,  it  facilitates  verified  compilation 
of  non-trivial  concurrent  runtime  components.  To  demonstrate  the  applicability  of  our  approach, 
we  have  also  written  a  concurrent  garbage  collector  based  on  the  algorithm  presented  in  [23 1.  A 
particular  characteristic  of  this  garbage  collector  is  that  it  exploits  knowledge  about  TSO  (weak 
memory)  behavior  by  not  adding  unnecessary  fences  whose  inclusion  would  otherwise  incur  sub¬ 
stantial  performance  penalities.  We  have  proven  the  atomicity  of  the  pieces  of  code  that  are  neces¬ 
sary  to  implement  this  garbage  collector  (including  write  barriers  and  allocation).  In  the  absence 
of  our  TSO-aware  refinement  methodology,  significantly  more  fences  would  be  necessary  to  make 
our  correctness  proof  tractable,  resulting  in  diminishing  collector  performance.  Our  initial  inves¬ 
tigation  of  how  garbage  collectors  interact  with  client  code  was  further  refined  and  significantly 
enhanced  in  later  phases  of  the  project,  where  we  substantially  augmented  the  definitions  and  ca¬ 
pabilities  of  the  intermediate  representations  used  by  the  compiler  to  facilitate  more  sophisticated 
reasoning  about  garbage  collection  behavior. 
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Because  our  development  was  initially  framed  in  the  context  of  the  Total  Store  Order  relaxed 
memory  model,  ensuring  compiler  correctness  became  challenging  because  high-level  actions  are 
translated  into  sequences  of  non-atomic  actions  with  compiler-injected  snippets  of  racy  code;  the 
behavior  of  this  code  depends  not  only  on  the  actions  of  other  threads,  but  also  on  out-of-order 
reorderings  performed  by  the  processor.  A  naive  proof  of  correctness  would  require  reasoning 
over  all  possible  thread  interleavings.  Instead,  we  developed  a  refinement-based  proof  method¬ 
ology  that  precisely  relates  concurrent  code  expressed  at  different  abstraction  levels,  cognizant 
throughout  of  the  relaxed  memory  semantics  of  the  underlying  processor.  Our  technique  allows 
the  compiler  writer  to  reason  compositionally  about  the  atomicity  of  low-level  concurrent  code 
used  to  implement  managed  services. 


While  formalizing  language  behavior  in  the  context  of  a  hardware  memory  model  like  TSO  is  use¬ 
ful  and  essential  to  understanding  a  realitic  certified  compilaton  strategy,  it  is  insufficient  in  the 
context  of  a  language  like  Java  because  it  fails  to  capture  and  express  executions  defined  in  terms 
of  the  Java  Memory  Model’s  view  of  allowable  reorderings.  The  JMM  is  intended  to  character¬ 
ize  the  meaning  of  concurrent  Java  programs.  Because  of  the  model’s  complexity,  however,  its 
definition  cannot  be  easily  transplanted  within  an  optimizing  Java  compiler,  even  though  an  im¬ 
portant  rationale  for  its  design  was  to  ensure  Java  compiler  optimizations  are  not  unduly  hampered 
because  of  the  language’s  concurrency  features.  In  response,  the  JSR-133  Cookbook  for  Com¬ 
piler  Writers  [49],  an  informal  guide  to  realizing  the  principles  underlying  the  JMM  on  different 
(relaxed-memory)  platforms  was  developed.  The  goal  of  the  cookbook  is  to  give  compiler  writ¬ 
ers  a  relatively  simple,  yet  reasonably  efficient,  set  of  reordering-based  recipes  that  satisfy  JMM 
constraints. 


As  part  of  our  overall  effort  on  certifying  the  correctness  of  Java  compilers  and  their  associated 
runtime,  we  developed  the  first  systematic  formalization  of  the  cookbook,  providing  a  semantic 
basis  upon  which  the  relationship  between  the  recipes  defined  by  the  cookbook  and  the  guarantees 
enforced  by  the  JMM  can  be  rigorously  established.  Notably,  one  artifact  of  our  investigation  is 
that  the  rules  defined  by  the  cookbook  for  compiling  Java  onto  the  Power  multicore  microprocessor 
are  inconsistent  with  the  requirements  of  the  JMM,  a  surprising  result,  and  one  which  justifies 
our  belief  in  the  need  for  formally  provable  definitions  to  reason  about  sophisticated  (and  racy) 
concurrency  patterns  in  Java,  and  their  implementation  on  modern-day  relaxed-memory  hardware. 


A  consequence  of  our  formalization  is  the  ability  to  mechanize  simulation  arguments  between  an 
architecture-independent  intermediate  representation  of  the  kind  suggested  by  [49 1  with  machine 
abstractions  for  Power  and  x86.  Moreover,  our  technique  enabled  a  methodology  for  providing 
fixes  for  cookbook  recipes  that  are  inconsistent  with  the  behaviors  admitted  by  the  target  platform, 
and  prove  the  correctness  of  these  repairs. 
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Source  IR  <p-> 

RTL2  4=  RTL2 

RTL 

Figure  2:  Proof  strategy 

4  RESULTS  AND  DISCUSSION 


4.1  Proof  Methodology 


We  have  developed  a  new  proof  methodology  to  verify  the  correctness  of  compiler  translations 
from  a  high-level  intermediate  representation  with  support  for  object  allocation,  field  access,  thread 
creation  and  synchronization,  and  memory  management  to  a  low-level  structured  RTL  representa¬ 
tion  expressed  in  an  IR  called  RTL2.  The  RTL1  IR  is  patterned  after  CompCertTSO’s  [76]  RTL,  an 
IR  that  expresses  unstructured  control  flow  graphs,  additionally  allowing  the  expression  of  high- 
and  low-level  statements;  these  statements  are  expressed  in  a  structured  language  called  X.  An 
important  aspect  of  X  is  its  support  for  coarse-grained  atomic  instructions,  that  while  not  directly 
available  in  the  target  architecture,  are  only  used  to  support  our  atomicity  refinement  proofs.  As 
such,  there  is  a  sublanguage  of  X  which  contains  all  the  low-level  (fine-grained)  statements  that  are 
directly  supported  by  the  architecture,  we  denote  this  language  by  XL  (read  “Inject  Low”). 


Our  new  proof  methodology  is  based  around  an  expressive  notion  of  refinement  that  enables 
lightweight  compositional  reasoning  of  concurrent  and  potentially  racy  code  within  a  verified  com¬ 
piler  framework.  We  concentrate  on  the  code  that  is  injected  by  the  compiler  to  support  services 
such  as  allocators,  collectors,  synchronization,  etc.  Our  methodology  is  integrated  within  the  Com- 
pcertTSO  verified  compiler  stack  f76|.  The  refinement  technique  supports  TSO  relaxed  memory 
semantics  to  allow  the  verification  of  low-level  concurrent  code  in  the  context  of  x86  multipro¬ 
cessors.  We  have  validated  our  methodology  via  the  verified  compilation  of  injected  concurrent 
program  fragments  that  interact  with  a  realistic  concurrent  garbage  collector.  Figure  2  illustrates 
our  methodology.  The  shaded  portion  is  enabled  via  our  refinement  methodology.  RTL2  programs 
are  successively  refined  to  replace  low-level  statements  with  high-level  ones  based  on  our  refine¬ 
ment  rules.  <r->  is  the  basic  backward  simulation.  •<=  is  the  backward  simulation  from  refinement. 


Figure  [T]  presents  the  X  language,  with  X\_  restricted  to  the  two  first  lines  of  the  grammar.  Xl 
has  mostly  standard  commands  with  the  exception  that  all  statements  operate  on  registers,  here 
ranged  by  the  metavariables  d,  r,  o,  n  and  f  representing  a  sequence  of  registers.  X\_  includes, 
skip,  sequencing,  standard  arithmetic  and  boolean  operators,  conditionals,  repeat— until  loops, 
loads-from  and  stores-to  memory  (where  the  registers  are  assumed  to  contain  memory  locations), 
a  compare-and-set  statement  corresponding  to  the  CAS  instruction  found  on  x86  processors,  a  fence 
command  for  memory  ordering  purposes,  and  an  abort  command  to  denotep  exceptional  behavior. 
Notice  that  the  commands  loaded,  r)  and  store^d,  r)  have  a  visibility  annotation  v  which  can  be 
Local  or  empty.  This  annotation,  which  has  no  runtime  effect,  indicates  in  the  program  syntax 
that  no  other  thread  in  the  system  can  modify  the  references  being  accessed  by  the  command. 
More  unusual  are  the  “high-level”  assume,  branch,  loop  and  the  coarse-grained  atomic  statements 
which  complete  the  X  language.  Atomic  statements  execute  disallowing  actions  from  other  threads, 
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loops  execute  their  body  a  non-deterministic  number  of  times,  and  branches  non-deterministically 
choose  the  branch  they  should  execute;  “incorrect”  choices  simply  manifest  as  failed  assumptions 
(expressed  through  assume)  in  the  resulting  execution. 

We  inject  terms  of  X  on  top  of  the  RTL  intermediate  representation  of  the  CompCertTSO  f75|  veri¬ 
fied  compiler.  Thus,  some  nodes  of  the  RTL  language  of  CompCertTSO  will  contain  X  statements. 
RTL1  (read  “RTL-Inject”)  is  the  language  resulting  from  combining  RTL  with  X.  The  sublanguage 
that  results  by  combining  RTL  with  the  XL  sublanguage  of  X  is  denoted  RTL^. 

Given  a  low-level  statement  si  defined  as  part  of  the  translation,  we  must  construct  a  high-level 
statement  that  matches  a  provided  specification  Sh,  defined  in  terms  of  atomic,  assume,  loop, 
branch,  and  sequence  commands;  and  a  proof  that  si  refines  Sh  (written  si  =4  Sh )•  s;  is  a  proper  im¬ 
plementation  of  Sh  whenever  the  visibility  annotations  of  s/  hold.  To  ease  the  construction  of  such 
proof,  we  provide  a  set  of  compositional  rules  that  can  be  applied  interactively  using  the  Coq  proof 
assistant.  These  rules  avoid  the  need  to  modify  the  semantics  of  any  intermediate  representation. 
We  show  an  excerpt  of  selected  rules  provided  in  our  development  in  Figure  [3} 

The  rule  Trans  establishes  the  obvious  transitivity  property  of  refinement.  IfBranch  and  Re¬ 
peat  allow  control  structures  to  be  replaced  by  a  combination  of  assume,  loop  and  branch  state¬ 
ments.  For  example,  a  repeat  statement  can  be  refined  into  one  that  executes  its  body  a  non- 
deterministic  number  of  times,  verifying  that  the  terminating  condition  is  not  satisfied,  and  a  ter¬ 
minating  iteration  where  the  condition  is  satisfied.  IfAtomic  allows  an  i  f  whose  branches  are 
atomic  to  be  transformed  into  an  atomic  if. 


The  C  AS-Fail  rule  establishes  a  refinement  between  a  failed  CAS  operation  and  a  1  o  ad  operation 
that  reads  the  contents  of  the  location  in  register  r  into  the  destination  register  d.  As  in  x86-TSO, 
the  load  performed  by  the  CAS  must  be  preceded  by  a  fence  command.  A  CAS  fails  when  the 
presumed  old  value  is  not  the  same  as  the  value  read.  Thus,  the  sequence  of  low-level  statements 
that  performs  the  CAS  and  then  assumes  the  failing  condition  is  a  refinement  of  a  simple  load  on 
the  location.  In  contrast,  a  successful  CAS  must  atomically  store  the  new  value  into  the  location, 
assuming  the  location  still  contains  the  presumed  old  value  (Cas-Success).  Notice  that  unlike 
Cas-Fail,  the  Cas-Success  rule  does  not  require  a  fence.  This  is  because  the  semantics  of 
atomic  blocks  implicitly  requires  that  the  TSO  write-buffers  be  empty,  similar  to  the  fence  instruc¬ 
tion  (see |subsubsection  4.1 .1[).  Rule  SwapAssume  lifts  assumptions  above  other  statements  in  a 
sequence.  Rule  DEAD  allows  us  to  remove  a  statement  with  an  unused  effect.  This  is  a  typical 
exercise  with  racy  algorithms:  a  while  or  repeat  loop  spins  until  the  current  thread  takes  its 
turn  on  a  shared  memory  access.  By  turning  such  a  loop  into  a  mix  of  loop  and  assume  statements, 
the  last  iteration  where  the  thread  gets  its  launching  window  becomes  explicit.  The  previous  itera¬ 
tion  block  is  generally  a  dead  block  that  can  be  removed  since  the  actions  performed  within  those 
iterations  have  no  observable  effect.  The  rule  FenceAtomic  is  an  obvious  consequence  of  the 
fencing  behavior  of  atomic  that  flushes  the  store  buffer  upon  completion.  FenceElim  allows  us 
to  remove  unnecessary  fences.  AfterAbort  indicates  that  no  commands  are  executed  after  an 
abort. 


The  rule  MAKES  tore  Atomic  is  implied  by  the  fencing  behavior  of  atomic  and  observing  that 
stores  are  indivisible  operations.  A  similar  argument  is  applied  for  MakeLoadAtomic,  but  in 
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Refl 


s  =4  s 


Trans 

Si  =4  S2  S2  =4  S 3 

s  1  =4  S3 


Repeat 


repeat  s  until  c  =$  loop  (s;  assume  ->c);  s;  assume  c 


IFB  RANCH 


if  c  then  si  else  S2  branch  (assume  c ;  si),  (assume ->c ;  S2) 


IfAtomic 


if  c  then  (atomic  sq)  else  (atomic  si)  =<:  atomic  (if  c  then  sq  else  si) 


CAS-fail 


cas (d,  r,  o,  n);  assume  o  /  d  =4  fence;  load  (ci,  r) 


CAS-success 


cas (d,  r,  o,  n);  assume  o  =  d  =4  atomic  (load(d,  r);  assume  o  =  d ;  store  (r,  n)) 
SwapAssume  DeadCode 


defines(s)  n  uses(c)  =  0  si  is  dead 

s;  assume  c  ^  assume  c ;  s  si;s2^S2 


FenceAtomic 
fence  ^  atomic  skip 


FenceElim 
fence  skip 


AfterAbort  MakeStoreAtomic  MakeLoadAtomic 


abort;  s  ^  abort  store (d,  r);  fence  =<;  atomic  store(d,  r)  fence;  load  (r,  d)  ^4  atomic  load  (r,  d) 

GrowAtomicLocal  EFLeft 

so  G  {  storeLocai(rf,  r),  load|_0cai(d,  r)  }  si  is  effect  free 

sq;  atomic  si  =<!  atomic  (sq;  si)  si;  atomic  S2  A  atomic  (si ;  S2) 


EFRight 

S2  is  effect  free 

atomicsi;s2  A  atomic  (si;  S2) 


Figure  3:  Compositional  rules  of  the  refinement  predicate  (excerpt). 
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Arrow 

Synchronizes 

Meaning 

ev 

Single-thread  Contribution 

ev v 

TSO  Memory  Machine 

ev^ 

ev  ev ^ 

Memory  and  Threads  Composition  (no  atomics) 

—^t 

ev ^  ev 

Full  System  Composition 

tr 

ev  tr 

Abstract  Environment  Trace 

tr 

tr  tr' 

Single-thread  with  Abstract  Environment 

Figure  4:  Synchronization  of  the  Different  Semantics 

this  case  the  fence  is  required  to  precede  the  load,  which  in  TSO  disallows  the  load  from  overtaking 
previously  issued  writes  in  the  buffer.  Perhaps  the  most  interesting  rule  is  GrowAtomicLocal 
which  allows  local  memory  operations  (i.e.,  loads  and  stores)  to  be  moved  within  an  atomic  block; 
such  aggregation  is  clearly  acceptable  since  the  effect  of  the  operation  is  not  observable  to  the 
environment.  This  is  guaranteed  by  the  Local  visibility  annotation,  which  implies  that  the  pointer 
in  the  register  r  cannot  be  changed  by  the  environment  (neither  can  it  be  observed  in  the  case  of 
a  store).  Similar  rules  EFLeft  and  EFRlGHT  apply  for  effect  free  operations  (i.e.,  which  only 
manipulate  registers). 


Note  that  the  rules  shown  in  Figure  [3]  are  purely  syntactical.  This  helps  us  reduce  the  burden  of  in¬ 
teractively  applying  them  by  a  set  of  custom  Coq  tactics  that  automatically  explore  a  program  tree 
in  order  to  find  a  subterm  that  fits  with  a  given  refinement  rule.  Some  rules  such  as  DEAD  require 
discharging  some  preconditions  in  order  to  be  applied.  We  discharge  these  preconditions  using 
Coq’s  reflection  capabilities;  the  predicates  are  executable  and  we  let  Coq  prove  them  by  compu¬ 


tation.  Significantly,  these  rules  are  sound  with  respect  to  the  semantics  given  in  Section  4.1.1 


To  validate  the  efficacy  of  our  refinement  methodology  for  the  verification  of  a  managed  concurrent 
programming  language  such  as  Java,  we  have  devised  a  block-structured  Managed  Intermediate 
Representation  (MIR),  which  we  compile  to  RTL^  and  subsequently  to  x86-TSO  using  the  Com- 
pCertTSO  tool  chain.  MIR  exposes  typical  features  found  in  a  managed  language  such  as  object 
allocation,  field  access,  synchronization,  etc.,  as  well  as  high-level  concurrency  primitives  such  as 
locks,  threads,  non-blocking  stacks  and  garbage  collection.  MIR  has  been  designed  to  serve  as  a 
reasonable  IR  target  for  Java  bytecodes. 


The  compiler  is  sufficiently  complete  to  compile  data-allocation  intensive  programs  such  as  the 
binary-trees  benchmark^  Running  this  program  shows  that  the  collector  effectively  traces  the 
heap  and  collects  free  objects  in  parallel  with  user  code. 
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4.1.1  Formalization 


In  this  section,  we  present  the  semantics  that  justify  our  methodology.  Figure  [4]  presents  the  dif¬ 
ferent  relations  (arrows)  we  use,  and  the  way  in  which  they  synchronize.  We  start  our  discussion 
with  the  semantics  of  X.  We  elide  the  semantics  of  RTL1 2,  which  is  simply  the  semantics  of  the 
RTL  language  of  CompCertTSO  with  the  additional  commands  of  X.  As  mentioned  before,  only 
terms  in  X|_,  the  low  level  commands  of  X,  are  compiled  into  RTL.  Terms  in  Xh  need  not  have  an 
obvious  implementation  in  RTL,  and  only  serve  to  facilitate  our  proofs. 

Our  semantics  are  structured  as  the  composition  of  different  labeled  transition  systems.  Figure  [5] 
presents  the  events  and  small-step  semantics  of  individual  commands  of  the  X  language.  Notice 
that  we  have  added  placeholders  {7r},  standing  for  assertion  predicates,  to  the  syntax  of  load  and 
store  instructions.  These  predicates  will  not  be  used  in  the  definition  of  the  program  semantics 
but  are  necessary  to  support  the  rely-guarantee  proof  methodology  used  to  aid  compositional  proof 
reasoning. 

Our  labels  are  composed  of  memory,  synchronization,  and  error  events.  Memory  events  M.Ev, 
roughly  correspond  to  the  memory  operations  available  in  the  x86  architecture.  These  include: 
reads  rd^  i;,  representing  the  query  of  memory  location  p  which  returns  value  v;  writes  stPi„,  repre¬ 
senting  the  result  of  a  store  to  a  location  of  a  value  v  found  in  a  memory  location  p;  compare-and-set 
events  casPtVy:W,  representing  an  atomic  read-modify  operation  on  memory  location  p  where  v  is 
the  expected  value,  v'  is  the  value  to  be  stored  in  p  and  w  is  the  result  of  the  read  -  notice  that 
the  update  is  executed  only  if  v  and  w  coincide;  an  event  recording  the  execution  of  a  memory 
fence  and  a  special  event  to  denote  the  flush  of  a  TSO  buffer  ubffp,;.  The  full  set  of  events 
Ev  includes  memory  events  as  well  as  a  r  (empty  event)  corresponding  to  a  thread-local  opera¬ 
tion;  we  omit  such  labels  in  general;  >  and  <  events,  representing  the  beginning  and  the  end  of  an 
atomic  command  respectively;  and  an  abort  event,  f,  generated  by  the  abort  command  to  represent 
exceptional  execution. 

The  semantics  of  Figure  [5] represents  the  contribution  of  each  thread,  through  events,  to  the  overall 
system.  Figure  [6]  shows  the  small-step  semantics  of  the  composition  of  different  threads  and  their 
interaction  through  shared-memory.  Recall  that  based  on  the  syntax  of  Figure  [I]  metavariables 
r,  o,  n,  d  G  Registers  represent  registers,  v  ranges  over  values,  and  p  represents  a  memory  location. 
We  distinguish  the  sublanguage  XL  of  X  by  disallowing  the  high-level  statements  for  X  (i.e.,  assume, 
loop,  branch  and  atomic). 

Thread  local  evaluation  is  defined  by  a  small-step  evaluation  judgment  of  the  form  s,  rs  s',  rs', 
where  s  and  s'  are  commands  in  X,  rs,  rs'  €  RegMap  represent  register  maps,  associating  values  to 
the  registers  of  the  thread.  We  use  the  command  skip  to  represent  termination.  The  notation  rs[r  : 
v ]  denotes  a  register  map  that  associates  v  to  the  register  r.  The  judgment  states  that  evaluating 
statement  s  with  a  register  map  rs  yields  a  state  with  continuation  s'  and  a  new  register  map  rs' 
while  emitting  the  event  ev.  Notice  that  when  an  abort  command  is  executed,  the  whole  command 
is  immediately  terminated  -  with  continuation  skip  and  abort  event  f .  Since  load  and  compare-and- 
set  judgments  are  defined  in  isolation  from  the  memory  judgments,  but  depend  on  the  memory, 

1  http  ://shootout.  alioth.  debian.  org  / 
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Language : X 

sGX  ::=  skip  |  d  =  op(r)  |  s;  s  |  if  c  then  s  else  s  |  repeat  s  until  c 

{7r}  load ,,(d,r)  |  {t}  stored,  r)  |  cas (d,r,o,n)  |  fence  |  abort 
atomics  |  assume c  |  branch  s,  s  |  loops 
v  ::=  Local  |  e 

Events  :  MEv,  Ev 


6  G  AdEv  . .  id p,v  |  stp;W  |  casP:VytW  |  ubffp^  |  /j- 

ev  £  Ev  ::=  e  |  r  |  >  |  <  |  f 


Step  Evaluation  :  (X  x  RegMap)  — 

loaded, r),  rs[r  :  p\ 
stor eu(d,r),  rs[d  :  p,r  :  v] 

cas (d,  r,  o,  n),  rs[r  :  p,  o  :  v,  n  :  v'] 
d  =  op(f),  rs 
skip;s,  rs 
if  c  then  si  elses2,  rs 
if  c  then  si  elses2,  rs 

repeat  s  until  c,  rs 

fence,  rs 
loop  s,  rs 
loop  s,  rs 
branch  si,  s2,  rs 
assume c,  rs 
atomics,  rs 
endatomic,  rs 
abort,  rs 


s0,  rs 


s'o,  rs' 


(X  x  RegMap) 


St„ 


# 


(s0;si),  rs  ^7  (s'0;si),rs' 


skip,  rs[d  A-  r>] 

skip,  rs 

skip.  rs[d  <—  w] 

skip.  rs[d  A-  C?(op,f)] 

s,  rs 

Si,  rs 

s2,  rs 

s;  if  c  then 

repeat  s  until  c 
else  skip 

skip,  rs 
(s;  loops),  rs 
skip,  rs 
s^  rs 

skip,  rs 

s;  endatomic,  rs 
skip,  rs 
skip,  rs 


s0,  rs 


t 


s0)  rs 


if  C(c,  rs) 
if  -iC(c,  rs) 


rs 


*£{1,2} 

if  C(c,  rs) 


(s0;  Si),  rs  -v  skip,  rs' 


Figure  5:  Events  and  thread-local  semantics  of  X. 
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their  rules  are  non-deterministic.  For  example  a  rd/v,  step  must  admit  every  possible  value  v  as  its 
return  value.  The  value  is  only  constrained  when  synchronizing  with  the  memory,  where  only  one 
value  can  be  read.  The  property  of  accepting  all  possible  return  values  is  called  receptiveness  in 
CompCertTSO  [76],  and  our  semantics  uses  the  same  principle. 

Statements  fence  and  cas (d,  r,  o,  n)  emit  the  events  #  and  respectively  with  the  obvious 

semantic  rules.  For  the  latter  instruction,  the  memory  location  to  be  read-and-modified  is  contained 
in  the  register  r.  Hence,  if  the  register  r  contains  a  pointer  p\  the  expected  value  for  the  pointer, 
given  in  register  o,  is  v:  the  value  to  write,  in  register  n,  is  v'\  and  the  actual  value  of  p  in  memory  is 
w,  the  instruction  generates  the  event:  casfhVJI\w.  The  value  w  is  placed  in  the  destination  register 
d.  In  the  case  where  v  =  w,  the  location  p  is  updated  to  v' ,  otherwise  it  remains  unchanged.  Here 
also,  the  rule  for  casp<vytW  is  receptive.  Rules  related  to  local  control  flow  emit  r  events,  whose 
labels  we  omit  since  their  effect  is  not  observable  for  other  threads. 


The  command  loop  s  nondeterministically  chooses  to  either  execute  the  statement  s  and  continue 
looping,  or  terminate  immediately.  The  statement  branch  si,  s2  nondeterministically  chooses  to 
execute  si  or  s2.  The  command  assume  c  only  proceeds  if  register  map  rs  satisfies  the  condition 
c.  The  atomics  command  executes  s  atomically,  ensuring  that  the  effect  of  the  atomic  action 
is  propagated  to  memory  from  the  local  store  buffer  upon  completion;  endatomic  is  a  runtime 
statement  simply  used  as  a  marker  to  record  the  end  of  an  atomic  section.  It  is  not  part  of  the 
source  code  syntax. 


In  Figure  [6]  we  present  the  semantics  of  thread  composition  stratified  into  two  parts:  (1)  the 
semantics  of  the  memory  machine,  and  (2)  the  overall  system  behavior  composing  the  memory  and 
the  threads.  The  memory  machine  implements  the  TSO  memory  model  following  the  guidelines 
of  CompCertTSO.  The  memory  state,  which  shall  remain  abstract  throughout  the  paper,  contains 
a  store,  mapping  memory  locations  to  values,  and  a  write  buffer  for  each  thread.  A  write  buffer  is 
simply  a  FIFO  queue  of  store  events  of  the  form  (p,  v)  (a  pending  store  of  a  value  v  at  address  p). 
Given  a  memory  M  E  Mem,  we  use  projections  M.  m  and  M.  b  to  obtain  the  store  and  the  buffer 
map,  resp.;  M.b(t)  represents  the  buffer  of  thread  t  and  the  operations  bufferPush(M,  t,  (p,v)), 
bufferPop(M,  t),  updateMem(M,  (p,  v))  and  emptyBuff  have  the  obvious  meanings,  where  M  is 
a  memory  state,  t  a  thread.  lastln(T>,  p)  returns  the  value  of  the  last-in  item  in  the  store  buffer  B 
for  the  location  p.  Finally,  the  operation  CAS  M  pvv'  =  (w,  M')  returns  the  pair  containing  the 
value  w  read  in  the  memory  M  for  pointer  p,  and  accordingly  the  new  memory  M'  (which  will 
differ  from  M  in  case  the  operation  was  successful.) 


The  semantics  of  the  memory  machine  is  described  by  judgments  of  the  type  M  —‘■t  M'  which 
represent  the  execution  of  an  event  ev  by  thread  t  and  that  modifying  the  memory  state  M  into 
the  state  M' .  These  rules  closely  follow  the  memory  machines  described  in  [15, 76 1;  note  that  in 
the  rule  for  reading,  we  use  the  notation  lastln(M.b(t),  p)M. m(p)  to  indicate  that  the  absence  of 
location  p  in  the  store  buffer  for  thread  t  -  i.e.,  p  (f  dom(M.b(f))  -  results  in  reading  the  contents 
of  p  from  memory:  M.  m  (p).  Note  that  the  unbuffering  is  the  only  memory  operation  that  is  not 
derived  from  the  program  syntax;  it  can  be  applied  at  any  time  when  the  thread  buffer  is  non-empty, 
and  flushes  some  unspecified  portion  of  the  buffer  to  memory. 


The  state  of  the  whole  system  is  comprised  of  two  components,  a  global  memory,  and  a  thread  map 
(II),  which  maps  thread  identifiers  to  thread  states;  these  states  contain  the  registers  and  the  code 
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Memory  :  Mem  -^Vd  Mem  WITH  Mem  =  ((Refs  Value)  x  (Tid  Buf )) 

v  =  lastln(Af.b(i),  p)M.m(p )  Af7  =  bufferPush(Af,  t,  (p,v)) 


Af  M 


M  M' 


CAS  Mpvv1  =  ( w,M ')  M.b(t)  =  emptyBuff 


M 


>,v,v  ,w 


Af7 


M.b(t)  =  emptyBuff 

M  M 


bufferPop(Af,  t)  =  ((p,v) ,  M')  f  A  M "  =  updateMem(Af7,  (p,v)) 


Af  M" 


Memory  Composition  :  (Mem  x  ThrdSt )  -tTd  (Mem  x  ThrdSt ) 


Memory  Step 
sf  sf7  AT  AT' 

(M,st)  t  (■ M’,st 7) 


Intra  Step 

Sf  —7  6't7 


(Af,  st)  ->t  (Af,  st ') 


Unbuffer 


M  t  m' 


(M,  st)  (Af7,  st) 


Thread  Composition  :  (Mem  x  ThrdMap)  —?Tui  (Mem  x  ThrdMap) 

Interleave  Nonatomic 

(Af,  n(t))  (Af7,  st7)  eu  f  > 

(Af ,  n)  ->t  (Af7,n[t^sf7]) 

Interleave  Atomic 

n(f)  4  st'  (Af,  n[t  <-  st7])  (Af7,  n7)  Af7.b(t)  =  emptyBuff  II7 (t)  4  st" 

(Af ,  n)  -h  (Af7,n7[t  4-  st77]) 

Figure  6:  Memory  and  Thread  Composition  Semantics 
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of  the  thread.  There  are  two  judgments  in  this  semantics.  Judgments  of  the  form:  (M,  st)  -->/ 
(. M',st '),  where  st  is  a  the  thread  state  for  thread  t  contains  f  s  continuation  and  register  map, 
represents  the  execution  of  a  step  by  t  with  respect  to  shared  memory.  The  rule  Memory  Step 
synchronizes  the  semantics  of  individual  threads  and  the  memory  system  by  having  the  events  in 
the  premises  and  in  the  consequent  coincide.  The  rule  Intra  Step  does  not  need  to  exercise  the 
memory  machine,  and  the  rule  Unbuffer  asynchronously  flushes  elements  of  the  buffer  into  the 
memory  without  modifying  the  thread  state. 

Thread  composition  judgments  have  the  shape  (M,  II)  — >t  ( M',W ).  This  semantics  captures 
in  a  single  step,  the  multiple  steps  that  could  be  required  to  execute  an  atomic  statement.  The 
rule  Interleave  Nonatomic  executes  any  statement  labelled  with  an  event  other  than  >.  The 
rule  Interleave  Atomic  executes  the  atomic  statement  in  a  single  step  thereby  ensuring  that 
all  actions  in  the  atomic  statement  occur  without  interleaving  of  other  threads  -  observe  that  the 
thread  identifier  in  the  premise  restricts  the  multistep  in  the  premise  to  only  execute  steps  of  thread 
t. 


4.2  Reconciling  Language  and  Processor  Memory  Models 


A  decade  ago,  the  semantics  of  concurrent  Java  programs,  the  JMM,  was  revised  and  redefined  [57]. 


This  revision,  which  was  adopted  as  part  of  the  official  Java  specification  [46 1  had  multiple  pur¬ 
poses.  First,  it  was  intended  to  replace  the  previous  specification  which  disallowed  many  common 
architectural  and  compiler  optimizations  of  Java  programs  that  were  found  in  many  state-of-the- 
art  implementations.  Second,  it  formalized,  using  a  rather  complicated  axiomatic  semantics,  the 
possible  behaviors  of  concurrent  Java  programs.  Its  formalization,  the  Data  Race  Freedom  (DRF) 
guarantee  0.  established  that  programs  that  do  not  have  data  races  (i.e.,  were  data-race  free)  in 
their  sequentially  consistent  (SC)  semantics,  can  only  exhibit  SC  behavior,  even  when  executed 
on  non-SC  hardware  [7  [.  Unfortunately,  due  to  the  complexity  of  the  formalism,  many  desirable 
properties  of  the  semantics  were  not  met,  and  many  undesirable  properties  were  not  prevented  [74]. 
In  light  of  these  shortcomings,  there  is  an  ongoing  community  effort  to  better  understand  and  re¬ 


consider  the  definition  of  the  JMM  [43]. 


A  testament  to  the  complexity  of  the  JMM  specification  is  the  The  JSR-133  Cookbook  for  Compiler 


Writers  [49 [ ,  an  informal  guide  to  implementing  the  JMM  in  different  computer  architectures.  This 


document  is  intended  to  aid  Java  compiler  writers  to  provide  safe,  reasonably  efficient  implemen¬ 
tations,  that  nonetheless  satisfy  the  JMM  requirements.  Unlike  the  JMM,  the  high-level  semantics 
of  Java  concurrency  is  described  operationally,  in  terms  of  memory  instruction  reorderings,  thus 
defining  the  relaxed  behaviors  a  program  may  exhibit,  in  a  form  suitable  for  reasoning  about  the 
correctness  of  compiler  optimizations. 

One  of  the  reasons  why  the  current  JMM  specification  is  so  complex  is  that  it  attempts  to  uniformly 
capture  the  set  of  memory  relaxations  induced  by  both  relaxed-memory  platforms  as  well  as  com¬ 
mon  compiler  optimizations  deemed  necessary  to  provide  performant  Java  implementations.  A 


recent  effort  [17]  has  considered  an  alternative  approach,  namely  giving  a  semantics  to  Java  that 


captures  only  the  relaxations  permitted  by  the  TSO  memory  model  found  on  x86  architectures  [63 1 . 
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One  could  attempt  to  implement  this  flavor  of  Java  in  weaker  architectures  such  as  on  IBM’s 
Power  [72]  platform,  but  this  is  a  substantially  more  challenging  exercise;  simply  retrofitting  the 
TSO-aware  semantics  developed  in  [  17]  for  Power  would  incur  a  high  performance  cost,  necessi¬ 
tating  injection  of  low-level  synchronization  operations  between  normal  variable  memory  accesses 
to  ensure  TSO  behavior. 


The  following  question  thus  presents  itself:  what  is  the  strongest  memory  model  that  would  be 
both  (1)  efficiently  implementable  -  not  requiring  synchronization  at  the  low  level  for  non-volatile 
variables  -  in  architectures  as  relaxed  as  Power,  and  (2)  yet  have  a  tractable  formal  semantics 
amenable  to  the  rigorous  proofs  needed  to  demonstrate  compiler  correctness  arguments  a  la  Com- 


pcertTSO  [76]?  As  a  corollary,  we  also  wished  to  understand  the  semantics  of  current  implementa¬ 
tions  of  a  JVM  with  respect  to  the  memory  model  it  supports.  JVMs  ensure  their  implementations 
are  consistent  with  the  JMM  by  making  conservative  decisions  on  synchronization  and  shared- 
memory  accesses.  Our  interest  was  in  determining  if  there  was  a  middle  ground  between  the  be¬ 
haviors  admitted  by  relaxed-memory  architectures  and  the  JMM,  which  provides  a  more  tractable, 
perhaps  stronger  semantics  than  the  JMM,  but  which  nonetheless  enables  compilers  to  provide 
acceptable  performance  for  modem  Java  applications. 


At  first  glance,  it  would  appear  that  many  of  these  questions  were  answered  in  [49].  However, 
given  that  [49]  is  an  informal  document,  with  no  clear  -  let  alone  formal  -  semantic  definitions, 
and  no  guarantees  that  the  rules  defined  are  correct,  our  research  focussed  on  a  methodology  to 
formalize  the  semantics  induced  by  its  “recipes”,  deriving  as  an  important  by-product,  a  provable 
validation  that  some  of  the  minimal  guarantees  required  by  the  JMM  are  satisfied.  In  this  sense, 
our  goals  were  broadly  similar  to  ]jS),  which  provides  a  provably  correct  compilation  strategy 
of  C++11  into  Power.  However,  operating  as  we  do  in  the  Java  context,  our  challenges  were 
substantially  different;  not  only  must  our  formalization  cope  uniformly  with  different  architectures 
given  the  platform  agnostic  definition  of  the  JMM,  but  it  must  also  deal  explicitly  with  a  number 
of  JMM-specific  features  such  as  its  support  for  “roach-motel”  reorderings,  explicitly  established 


as  a  requirement  of  the  JMM  [57].  These  issues  make  it  infeasible  to  seamlessly  transplant  the 
results  from  approaches  like  (8].  Unlike  |8J,  we  do  not  provide  a  concrete  compilation  strategy  - 
indicating  for  example  that  a  fence  has  to  be  emitted  immediately  after  a  volatile  store  -  but  rather 
indicate  minimal  constraints  that  must  be  satisfied  by  any  such  strategy  -  for  example  a  fence 
must  exists  in  between  a  volatile  store  and  any  subsequent  memory  action  -.  We  did  this  to  allow 


flexibility  to  capture  systems  like  Octet  [  10]  where  the  fences  might  be  added  in  garbage  collection 


safe  points  for  example.  This  follows  the  spirit  of  [49 1. 


Perhaps  surprisingly,  the  relation  between  p7|  and  [49 1  had  not  been  considered  formally  before, 


and  notably  our  results  show  that  the  rules  implied  by  [49]  for  Power  were  at  odds  with  the  re¬ 
quirements  of  the  JMM.  Concretely,  while  working  on  our  proofs  we  found  a  counter-example  to 


the  DRF  requirement  of  the  JMM  if  the  rules  of  [49]  were  used  for  Power.  The  example  in  ques¬ 
tion  is  the  infamous  litmus  test  -  reproduced  below  -  considering  only  volatile  variables  instead  of 
normal  variables.  In  Java,  concurrent  conflicting  accesses  to  volatile  variables  are  not  considered 
to  form  a  data  races.  We  display  the  example  below  with  each  thread  in  a  column,  and  we  assume 
that  the  object  o  is  shared  among  all  threads,  with  volatile  fields  v  and  w.  Variables  starting  with  r 


16 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


are  local  to  each  thread. 


o.v  =  o.w  =  0  &  both  fields  are  volatile 


o.v  =  1;  ||  o.w  =  1; 


rO  =  o.v; 
r  1  =  o.w ; 


r  2  =  o.w ; 
r3  =  o.v; 


Is  rO  =  r2  =  1  &  rl  =  r3  =  0  allowed? 

The  behavior  in  question  cannot  be  produced  under  a  sequentially  consistent  semantics.  However, 


this  behavior  is  possible  in  Power  [72  ] .  Moreover,  inserting  1  wsync  Power  barriers  in  between  the 
two  reads  in  the  reading  threads  would  not  prevent  this  behavior  from  happening  as  documented 
in  [13  721.3  Unfortunately,  lwsync  was  the  barrier  of  choice  recommended  by  [49]  when  our 
work  was  started  to  prevent  this  relaxation (3  We  tried  this  Java  example  in  a  Power  7  machine,  and 
were  able  to  reproduce  the  erroneous  behavior  in  the  two  different  JVM’s  we  testecQ  indicating 
that  this  is  not  simply  a  theoretical  inconvenience,  but  a  critical  dichotomy  between  desired  seman¬ 
tics  and  implementations.  Our  discussions  with  several  VM  implementors  indicated  that  (a)  the 
cookbook  was  heavily  used  as  a  crucial  reference,  given  the  complexity  of  the  official  specification, 
and  (b)  some  implementations  were  actually  aware  of  the  bug  noted  above,  while  others  were  not; 
given  the  subtlety  and  complexity  of  the  JMM,  and  the  lack  of  consensus  among  implementors  on 
a  proper  implementation  strategy,  the  anecdotal  evidence  made  clear  that  a  cookbook-like  docu¬ 
ment  is  quite  necessary,  with  a  provably  correct  version  even  more  so.  To  highlight  the  subtlety 
of  the  issues  involved,  parts  of  the  cookbook  were  in  fact  changed  [[8]  in  response  to  advances 
in  the  formalization  of  processor  memory  models  (e.g.,  [56}  72 1),  but  in  the  absence  of  a  formal 
definition,  those  changes  did  not  remediate  the  issues  noted  here. 

In  light  of  these  issues,  we  provided  the  first  formalization  (operationally)  of  the  semantics  of  com¬ 


piling  concurrency  features  in  Java  as  described  by  [49 [  into  the  x86  and  Power  relaxed-memory 
architectures.  Notably,  our  high-level  semantics  propagates  the  relaxations  admitted  by  Power  to 
normal  Java  variables.  Our  choice  to  propagate  Power  semantics  for  normal  variables  into  a  high- 
level  semantics  is  motivated  by  the  fact  that  any  stronger  semantics  at  the  high-level  would  impose 
synchronization  operations  for  normal  variables  in  Power,  one  of  the  weakest  processor  architec¬ 
tures  currently  available.  This  would  most  likely  greatly  degrade  the  performance  of  concurrent 
Java  programs  on  that  platform,  which  is  on  the  one  hand  unnecessary  given  the  JMM  definition, 


and  on  the  other  hand  not  required  by  [49].  We  considered  this  to  be  a  minimal  performance  re¬ 


quirement  for  any  acceptably  efficient  implementation  of  the  JMM  on  Power.  Given  that  Power 
is  one  of  the  weakest  architectural  memory  models  yet  studied,  we  view  our  high-level  seman¬ 
tics  as  an  upper  bound  of  how  strong  a  JMM  could  be,  without  penalizing  weak  architectures  like 


Power.  [49]  uses  an  intermediate  representation  to  express  memory  operation  reorderings.  We  for¬ 
malized  this  intermediate  representation,  and  proved  a  simulation  argument  between  source-level 
programs  and  programs  compiled  to  this  IR,  establishing  an  inclusion  property  between  behaviors 
allowed  by  the  target  architectures  (x86  and  Power)  and  this  IR.  We  additionally  formalized  the 


2The  behavior  manifests  because  lwsync  imposes  no  constraints  on  when  the  stores  performed  by  the  first  two 
threads  become  visible  to  the  readers. 

3 After  our  results  were  published,  the  cookbook  was  updated  based  on  our  findings. 

4The  example  failed  on  IBM’s  JVM  and  Jikes  RVM.  Similar  examples  failed  in  Fiji’s  realtime  JVM  implementation 
on  ARM  7. 
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Table  1:  High-level  Roach-Motel  Semantics  Rules 


1st  Op.\2nd  Op. 

Normal  Load  /  Store 

Volatile  Load  /  Lock 

Volatile  Store  /  Unlock 

Normal  Load  /  Store 

No 

Volatile  Load  /  Lock 

No 

No 

No 

Volatile  Store  /  Unlock 

No 

No 

different  target  architectures  we  considered  in  the  same  framework,  and  when  the  rules  of 
are  correct,  proved  that  they  are  so.  Additionally,  we  identified  the  rules  that  do  not  produce  cor¬ 
rect  implementations,  and  proposed  corrections,  which  we  then  proved  sufficient  to  enforce  the 
expected  high-level  semantics  (e.g.,  volatile  variables  must  exhibit  SC  semantics).  Our  findings 


have  been  propagated  to  the  current  revision  of  [49  ] .  These  results  provide  the  first  formalization 
that  relate  the  high-level  semantics  of  the  JMM  with  low-level  architectural  implementations  as 
described  in 


4.2.1  Methodology  Details 

Consider  the  requirements  of  the  JMM  with  respect  to  the  implementation  of  synchronization  op¬ 
erations,  and  its  relation  to  the  rules  provided  by  the  cookbook  document.  A  driving  principle  of 
the  JMM,  dubbed  the  roach  motel  semantics  f57fl,  is  that  increasing  the  synchronization  of  a  pro¬ 
gram  cannot  add  new  observable  behaviors  to  it.  The  synchronization  operations,  formally  defined 
in  [57],  include  locking  and  volatile  memory  access  operations^]  The  roach  motel  principle  implies 


that  all  program  transformations  that  increase  the  happens -before  [48]  relation  of  the  program  - 
which  captures  the  causality  relation  of  a  program  enforced  through  its  synchronization  actions 
(locks  and  volatile  accesses)  -  should  be  allowed  by  the  memory  model.  Pragmatically,  this  means 
that  normal  memory  operations  following  a  volatile  write  can  be  reordered  before  it,  since  the  re¬ 
sulting  program  imposes  additional  synchronization  not  required  by  the  former.  Similarly,  normal 
memory  operations  preceding  a  volatile  read  can  be  reordered  after  it.  An  argument  similar  to 
the  case  of  volatile  writes  applies  to  unlock  operations  (a  monitorexit  in  Java  bytecote),  and  the 
same  is  true  for  volatile  reads  with  respect  to  lock  operations  (monitorenter).  These  observations 


justify  the  first  table  presented  in  the  cookbook  [49],  that  describes  the  reorderings  possible  at  the 
highest- level  considered  in  that  document.  We  reproduce  this  in  Table  |T[  The  table  indicates  that 
two  operations  can  be  reordered  if  the  cell  is  empty,  and  that  they  cannot  if  the  cell  is  marked  “No”; 
the  first  operation  is  sampled  from  the  rows  and  the  second  one  from  the  columns.  Data  and  control 
dependencies  are  assumed  to  be  respected  by  the  cookbook  tables.  Then,  for  instance  two  normal 
memory  operations  on  different  references  can  be  freely  reordered,  but  any  two  synchronization 
operations  cannot. 


Intermediate  Representation.  Before  presenting  the  requirements  for  the  implementation  of 
these  operations  for  a  specific  architecture,  the  cookbook  introduces  an  intermediate  low-level 

5Thread  creation,  termination,  and  object  initialization  are  also  synchronization  operations,  but  they  are  not  relevant 
for  the  ideas  discussed  here. 
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Table  2:  Low-level  Cookbook:  Barriers  Required 


1st  Op.\2nd  Op. 

Normal  Load 

Normal  Store 

Volatile  Load/Lock 

Volatile  Store/Unlock 

Normal  Load 

LoadStore 

Normal  Store 

Sto  restore 

Volatile  Load/Lock 

LoadLoad 

LoadStore 

LoadLoad 

LoadStore 

Volatile  Store/Unlock 

Store  Load 

Sto  restore 

representation  in  which  memory  operations  are  not  assumed  to  have  inherent  ordering  semantics; 
instead,  operation  ordering  is  imposed  through  the  use  of  additional  barrier  -  or  fence  -  instruc¬ 
tions,  that  guard  the  kind  of  reordering  permissible  between  two  memory  accesses.  At  this  level, 
volatile  memory  operations  are  assumed  to  be  “implemented”  using  normal  memory  operations  - 
corresponding  to  the  operations  provided  by  the  ISA  of  the  target  architectures  -,  and  the  ordering 
constraints  of  Table  [I]  have  to  be  enforced  rather  than  assumed.  This  intermediate  representation 
assumes  that  there  is  a  different  barrier  to  prevent  the  reordering  of  any  two  kind  of  memory  op¬ 
erations  if  the  barrier  is  emitted  by  the  code  in  between  these  two  accesses.  For  example,  two 
read  operations  can  be  prevented  from  being  reordered  if  a  Load  to  Load  barrier  (Load Load)  is 
emitted  in  between  them  by  the  thread.  Similar  fences  exist  between  stores  and  loads,  loads  and 
stores  and  two  consecutive  stores.  Table  [2] presents  the  kind  of  barriers  that  must  be  introduced  in 
this  intermediate  representation  to  enforce  the  semantics  of  Java  delineated  by  Table  [I]  This  is  the 
second  table  of  [49]. 

Given  the  lack  of  a  precise  semantics  for  normal  load  and  store  instructions,  it  is  difficult  to  for¬ 
mally  establish  the  correspondence  between  the  high-  and  low-level  versions.  A  major  contribution 
of  our  work  was  the  definition  of  a  tractable  semantics  for  these  two  layers  that  enables  the  cor¬ 
rectness  proof  of  the  rules  relating  these  two  tables. 


Store- Atomicity  Relaxation  A  limitation  of  the  cookbook  document  is  that  the  argumentation 
is  made  in  terms  of  operation  reorderings,  which  disregards  store-atomicity  -  or  write-atomicity 
-  which  allows  write  operations  to  be  propagated  to  different  threads  at  different  times,  a  relax¬ 
ation  permitted  by  some  architectures,  including  Power  and  ARM  [|5j[72j.  One  could  imagine 
providing  a  semantics  which  considers  reordering  of  operations  as  the  only  source  of  relaxations 
in  the  style  of  the  TSO,  Partial  Store  Ordering  (PSO)  and  Relaxed  Memory  Ordering  (RMO)  [79]] 
memory  models.  However,  this  would  be  insufficient  to  capture  certain  important  relaxations  that 
are  permitted  by  architectures  with  weaker  memory  models;  the  following  example  (similar  to  the 
example  Write-Read  Conflict  (WRC)  of  [72]])  illustrates  this  issue. 


o.f  =  o'  .f  =  NULL 


o.f  =  o'  ||  (o.f).f  =  o 


rO  =  o'  .f ; 
rl  =  rO.f 


(1) 


rO  =  o  &  rl  =  NULL? 


This  program  has  three  threads,  which  share  two  objects  o  and  o',  each  with  a  single  field  /  initially 
NULL.  We  assume  that  the  type  of  the  field  /  is  the  same  as  the  type  of  o  and  o'.  In  the  result 
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indicated  at  the  end,  we  have  that  rO  =  o,  therefore  it  must  be  the  case  that  the  read  of  o'  .f  in 
the  third  thread  returns  the  object  o.  Indeed  this  is  possible  if  the  first  thread  executes  first,  then 
the  second  thread  dereferences  o.f  obtaining  o'  and  after  that  it  writes  o  into  o'.f.  Now  we  can 
fullfil  the  read  of  rO  in  the  third  thread.  It  is  obvious  that  the  read  of  rO.f  in  the  third  thread 
cannot  happen  before  rO  has  obtained  its  value  through  the  previous  read.  Therefore  these  two 
reads  cannot  be  reordered.  In  that  case,  if  the  only  source  of  relaxation  is  reordering,  the  read  rO.f 
which  in  actuality  is  a  read  of  o.f  must  see  the  value  o',  since  all  reorderings  are  prevented  through 
data  dependencies.  This  final  result  cannot  be  produced  by  a  reordering-only  memory  model. 
However,  this  is  a  possible  behavior  in  Power,  since  a  write- atomicity  relaxation  could  mean  that 
the  write  of  the  first  thread  is  only  propagated  to  the  second,  but  not  the  third  thread,  allowing 
the  third  thread  to  read  NULL  for  rl.  To  admit  such  behavior,  it  is  then  necessary  to  introduce 
write-atomicity  relaxations  existent  in  Power  within  the  (low-level)  cookbook  semantics  to  avoid 
over-synchronizing  normal  memory  accesses. 


Proof  Structure.  Figure  [7]  illustrates  the  overall  proof  structure  that  we  have  developed.  At  the 
top  level,  we  have  the  semantics  of  the  JMM  as  described  in  [57],  or  rather  the  improved  version 
of  [74].  Below  this  level,  we  have  a  high-level,  architecture-agnostic,  operational  semantics  which 
adopts  Power  semantics  for  normal  variables,  and  sequentially-consistent  semantics  for  volatile 
variables  and  locks.  We  denote  this  semantics  by  cookbook-high.  One  level  down,  we  have  the 
intermediate  representation  that  contains  only  normal  memory  accesses  and  barriers.  Finally,  at 
the  bottom  of  the  figure  we  have  the  semantics  of  the  Power  and  x86  architectures,  of  which  Power 
offers  a  more  relaxed  semantics.  We  establish  a  backwards  simulation  between  the  high  and  low- 
level  definitions  of  the  cookbook,  show  that  high-level  cookbook  semantics  respects  the  JMM,  and 
that  our  low-level  cookbook  definition  properly  captures  the  behaviors  admitted  by  x86  and  Power. 
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4.3  Garbage  Collection 


Modem  programming  languages  like  ML,  Java,  and  C#  rely  on  garbage  collection  (GC)  for  the 
automatic  reclamation  of  memory  no  longer  used  by  the  application.  The  GC  is  considered  to  be 
one  of  the  most  subtle  parts  of  modern  runtime  systems,  carefully  engineered  to  minimize  run¬ 
time  overheads  of  the  applications  it  supports.  A  family  of  garbage  collection  algorithms,  named 


on-the-fly  garbage  collectors  [18],  allows  the  detection  of  garbage  and  its  reclamation  to  occur 
concurrently  with  an  application’s  threads.  Such  algorithms  are  notably  difficult  to  implement, 
test,  and  prove,  and  constitute  a  significant  challenge  for  mechanized  verification.  Many  on-the- 
fly  algorithms  are  inherently  racy,  and  some  algorithms  never  require  application  threads  (called 
mutators )  to  wait  for  the  collector  thread,  which  detects  and  frees  unused  memory.  As  part  of  our 
research  goals,  we  considered  the  mechanized  verification  of  a  state-of-the-art  GC  algorithm  in 


this  landscape  [20422],  where  no  locks  are  required  -  i.e.  it  is  lock-free. 
This  challenge  has  been  identified  and  addressed  in  various  settings 


35j[36]].  Our  results 


provide  an  independent  proof,  exploring  a  different  proof  method  in  the  design  space.  First,  the 
backbone  of  the  formalization  is  a  new  compiler  intermediate  representation,  named  RtIR,  that 
we  have  developed  to  implement  the  garbage  collector.  Our  experience  implementing  on-the-fly 


garbage  collectors  [66[  indicates  that  the  choice  of  programming  abstractions  is  of  paramount 
importance  in  reasoning  and  optimizing  this  kind  of  algorithm.  This  concern  necessitates  a  repre¬ 
sentation  that  makes  the  expression  and  proof  of  invariants  tractable.  Moreover,  in  this  work,  we 
strive  to  make  our  proof  well  suited  to  the  context  of  our  larger  research  goals  as  described  above, 
aiming  at  the  formal  verification  of  a  compiler  for  concurrent,  managed  languages. 

Our  intermediate  representation  has  special  support  for  the  implementation  of  efficient  runtime 
mechanisms:  1.  strong  type  guarantees,  2.  abstract  concurrent  data  structures,  3.  high-level  iterators 
for  reflective  inspection  of  objects  used  to  implement  low-level  services,  e.g.  ensuring  the  garbage 
collector  visits  every  live  object  4.  native  support  for  threads,  and  5.  native  support  for  the  root 
management  of  a  concurrent  garbage  collector  (each  thread  must  be  able  to  iterate  over  the  set  of 
memory  references  it  can  access  directly). 

Another  important  characteristic  of  our  approach  is  the  dedicated  rely-guarantee  program  logic 
that  accompanies  our  intermediate  representation.  While  previous  approaches  |3T}|32[|36j  attack 
the  proof  by  means  of  an  abstract  state  transition  system  requiring  a  monolithic  global  invariant 


be  established,  we  followed  the  well  established  rely-guarantee  [44]  methodology.  RG  is  a  major 
technique  for  proving  the  correctness  of  concurrent  programs  that  provides  explicit  thread-modular 
reasoning.  In  this  setting,  interferences  between  threads  are  described  using  binary  relations:  relies 
and  guarantees.  Each  thread  is  proved  correct  under  the  assumption  it  is  interleaved  with  threads 
fulfilling  a  rely  relation.  The  effect  of  the  thread  itself  on  the  shared  memory  must  respect  its  guar¬ 
antee  relation.  This  guarantee  must  also  be  coherent  with  respect  to  the  relies  that  the  other  threads 
assume.  Being  able  to  reason  in  a  thread  modular  way  is  key  to  realize  a  tractable  correctness  proof 
because  it  avoids  the  need  to  explicitly  consider  all  possible  interleavings.  We  prove  the  soundness 
of  our  RG  logic,  and  develop  a  set  of  tactics  that  reduce  the  proof  effort  required  to  discharge  the 
invariants. 

Finally,  we  have  developed  an  original  incremental  proof  technique  that  we  put  in  place  to  carry 
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X,  Y  E  gvar 

x,y  £  lvar 

t,  m,  C  F  tid 

/  e  fid 

rn  E  list  fid 

cmd  3  c  := 

skip 

assume  e 

opxe 

1 

Ci  ;  c2 

ci  ©  c2 

loop(c) 

atomic  c 

1 

x  =  alloc(rn) 

free(x) 

isFree?(a;) 

1 

x  =  Y 

X  =  e 

load  x(f,y) 

storage,/) 

1 

a’.push(y) 

x  =  j/.  empty? () 

1  x  =  2/.top() 

1  atpopQ 

1 

X  =  j/.  copy  () 

f  oreach  (x  in  l( 

do  c  od 

I 

1 

foreachField  (/  of  x )  do  c  od 
foreachObject  x  do  c  od 

1 

foreachRoot  {x 

of  t)  do  c  od 

Figure  8:  Simplified  Syntax  of  RtIR 

out  this  large  endeavor.  Starting  from  the  full  GC  implementation,  we  progressively  annotate  the 
program  in  order  to  prove  stronger  and  stronger  invariants.  At  each  level,  dedicated  specification 
annotations  and  tactics  allow  us  to  refine  and  reuse  what  has  been  proven  at  the  previous  levels. 


Using  the  Coq  proof  assistant,  we  achieved  the  following  formalizations:  1.  the  syntax,  semantics 
and  the  soundness  of  an  RG  program  logic  for  our  intermediate  representation,  2.  a  number  of 
tactics  and  structural  lemmas  to  facilitate  the  so-called  stability  proofs  required  by  the  RG  method¬ 
ology,  3.  a  realistic  implementation  of  Domani  et  aids  GC  algorithm  [22 1  in  our  intermediate 
representation  and  4.  an  RG  proof  ensuring  the  correctness  of  the  GC:  the  collector  never  frees 
references  accessible  by  the  running  threads. 


4.3.1  The  RtIR  Intermediate  Representation 

Syntax.  Figure  [8]  shows  the  syntax  of  RtIR.  The  language  provides  two  kinds  of  variables: 
global  or  shared  variables  that  can  be  accessed  by  all  threads,  and  local  variables  used  for  thread- 
local  computations.  Expressions  (e)  are  built  from  constants  and  local  variables  with  the  usual 
arithmetic  and  boolean  operators.  Commands  include  standard  instructions,  such  as  skip,  assume  e, 
local  variable  update  opxe,  and  classic  combinators:  sequencing,  non-deterministic  choice  (ci  © 
c2),  and  loops.  The  conditional  (if  e  then  c\  else  c2)  can  be  defined  as  (assume  e;  ci)© (assume  !e;  c2), 
where  we  write  !e  for  the  boolean  negation  of  e.  While  loops  and  repeat-until  loops  can  be  encoded 
similarly.  RtIR  also  provides  atomic  blocks  (atomic  c).  In  our  GC,  we  use  atomic  blocks  only  to 
add  ghost-code  -  code  only  used  for  the  proof,  not  taking  part  in  the  computation  -  and  to  model 
linearizable  data  structures.  These  atomic  constructs  can  be  refined  into  low-level,  fine-grained 
implementations  using  techniques  such  as  the  atomicity  refinement  methodology  discussed  earlier. 

Instruction  alloc(rn)  allocates  a  new  object  in  the  heap  by  extracting  a  fresh  reference  from  the 
freelist  -  a  pool  of  unused  references  -  and  initializing  all  of  its  fields  in  the  record  name  rn  to  their 
default  value.  Conversely,  free  puts  a  reference  back  into  the  freelist.  Instruction  isFree?  looks 
up  the  freelist  to  test  whether  a  reference  is  in  it.  We  use  these  memory  management  primitives  to 
implement  the  GC. 
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In  RtIR,  basic  instructions  related  to  shared-memory  accesses  are  fine-grained,  i.e.  they  per¬ 
form  exactly  one  global  operation  (either  read  or  write).  These  include  loads  and  stores  to  global 
variables  and  field  loads  and  updates.  This  allows  us,  when  conducting  the  proofs,  to  consider 
each  possible  interleaving  of  memory  operations  arising  from  different  threads,  while  keeping  the 
semantics  reasonably  simple.  Apart  from  these  basic  memory  accesses,  RtIR  provides  abstract 
concurrent  queues  which  implement  the  mark  buffers  of  [22],  accessible  through  standard  opera¬ 
tions  y  =  x.top(),  x.pop(),  push (y),  x  =  y.e mpty?().  The  use  of  these  buffers  are  necessary 
for  the  implementation  of  the  GC.  While  we  could  have  implemented  these  data  structures  directly 
in  RtIR,  we  realized  that  proof  burden  would  be  significantly  alleviated  by  higher-level  reason¬ 
ing,  and  hence  to  assume  that  they  behave  atomically.  Mark  buffers  also  provide  an  operation 
X  =  y.copyQ,  to  perform  a  deep  copy,  only  used  in  ghost  code. 


A  salient  ingredient  of  RtIR  is  its  native  support  for  iterators,  enabling  easy  expression  of  many 
GC  bookkeeping  tasks.  The  iterator  f  oreach  (x  in  1)  do  c  od,  where  the  variable  x  can  be  free  in 
command  c,  iterates  c  through  all  elements  x  of  the  static  list  1.  Some  more  sophisticated  book¬ 
keeping  tasks  include  the  visiting  of  all  the  fields  of  a  given  object,  the  marking  of  each  of  the  roots 
-  references  bound  to  local  variables  -  of  mutators,  or  the  visiting  of  every  object  in  the  heap  (per¬ 
formed  during  the  sweeping  phase).  In  those  cases,  the  lists  of  elements  to  be  iterated  upon  is  not 
known  statically,  so  we  provide  dedicated  iterators.  The  iterator  f  oreachField  (/  of  x)  do  c  od 
iterates  c  on  all  the  fields  /  of  the  object  stored  in  x.  Command  f  oreachRoot  (r  of  t )  do  c  od 
iterates  over  the  roots  of  mutator  thread  t,  while  f  oreachOb  j  ect  x  do  c  od  iterates  over  all  objects. 
We  stress  the  fact  that  iterators  have  a  fine-grained  behavior:  the  body  command  c  executes  in  a 
small- step  fashion. 


Typing  information.  The  semantics  of  RtIR  is  enriched  with  typing  information.  Basic  types 
in  typ  include  TNum  for  numeric  constants,  TRef  for  references  to  regular  objects  (see  below), 
and  TRef  Set  for  non-null  references  to  abstract  mark-buffers.  Local  variables,  global  variables, 
and  field  identifiers  are  declared  to  have  exactly  one  of  these  types,  respectively  accessible  through 
functions  lvar_typ,  gvar_typ  and  f  id_typ.  RtIR  manipulates  two  kinds  of  values:  numeric 
values  in  the  Coq  type  z  and  references  in  ref.  Types  are  mapped  to  values  with  the  function 
value  of  type  typ  ->  Type. 

typ  =  {  TNum,  TRef,  TRef  Set  } 
lvar  =  varld  X  typ 
gvar  =  varld  X  typ 
fid  =  fieldld  X  typ 


Definition  value 

t :typ) : Type  := 

match  t  with 

|  TNum  =>  Z 

|  TRef  |  TRefSet 

=>  ref  end. 

Execution  states  Local  (resp.  global)  environments  map  local  (resp.  global)  variables  to  values 
of  their  declared  type.  Environments  are  hence  dependent  functions  of  type: 


Definition 

lenv 

:=  forall 

x : lvar, 

value 

(lvar_typ  x) . 

Definition 

genv 

:=  forall 

X : gvar, 

value 

(gvar_typ  X) . 
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A  thread-local  state  is  defined  by  a  local  environment  and  a  command  to  execute.  A  global  state 
includes  a  global  environment  ge  and  a  heap  hp  -  a  partial  map  from  references  to  objects. 
We  consider  two  distinct  kinds  of  objects:  regular  objects,  mapping  fields  to  values,  and  abstract 
mark-buffers. 

Definition  thread_state  :=  (cmd  lenv) . 

Record  gstate  :=  {  ge :  genv;  freelist:  ref  ->  bool; 
hp :  ref  ->  option  object;  roots:  tid  ->  ref  ->  nat  }. 


Global  states  also  include  two  components  essential  to  the  implementation  of  a  GC:  roots  and  a 
freelist.  The  freelist  is  indeed  a  shared  data  structure,  while  roots  are  considered  to  be  thread- 
local  -  mutators  are  responsible  for  handling  their  own  roots  with  thread-local  counters.  Here,  we 
model  roots  as  part  of  the  global  state  only  to  ease  proof  annotations  -  our  final  theorem  is  an 
invariant  of  the  program  global  state. 

Finally,  execution  states  include  the  states  of  all  threads  and  a  global  state. 

Definition  state  :=  ((tid  ->  option  thread_state )  gstate). 


Well-typedness  invariants 

A  number  of  invariants  are  guaranteed  by  typing:  (i)  each  variable  in  the  local  or  global  environ¬ 
ment  contains  a  value  of  the  appropriate  type,  (ii)  any  reference  of  type  TRef  is  either  null,  in 
the  domain  of  the  heap,  or  in  the  freelist,  and  (iii)  each  abstract  mark-buffer  is  accessible  from 
a  unique  global  variable,  indexed  by  a  thread  identifier.  This  mechanism  enforces  separation  of 
mark-buffers  by  typing. 

4.3.2  RtIR  Proof  System 

On  top  of  RtIR,  we  designed  a  program  logic,  based  on  a  variation  of  rely-guarantee,  based  on  our 
prior  experience  using  this  technique  for  atomicity  refinement.  When  thinking  about  a  particular 
thread’s  code,  we  shall  refer  to  the  actions  of  the  other  concurrent  threads  as  its  context.  This 
context  is  formally  encoded  as  a  rely  relation  stating  its  possible  execution  steps.  Thus,  each 
annotation  in  the  code  of  a  thread  must  be  proved  to  be  stable  w.r.t.  its  rely  condition,  meaning 
that  its  validity  is  not  affected  by  possible  state  changes  induced  by  any  number  of  rely  steps.  We 
follow  a  similar  approach  to  encode  guarantees.  In  fact,  throughout  our  development  we  only  need 
to  define  guarantees,  synthesizing  the  relies  of  other  threads  from  guarantees. 


High-level  design  choices  of  proof  rules 

In  our  approach,  we  firstly  annotate  a  program,  as  is  usually  done  on  paper,  and  then  prove  the 
annotated  program  using  syntax-directed  proof  rules.  We  thus  extend  the  syntax  of  commands  to 
include  annotations.  Syntax-directed  proof  rules  were  capital  for  proof  automation. 
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The  proof  system  decouples  sequential  and  concurrent  reasoning.  Its  first  layer  is  a  Hoare-like 
system,  with  no  use  of  relies  or  guarantees.  A  second  layer  handles  interference:  proof  obligations 
about  relies,  guarantees  and  stability  checks  of  annotations. 

Finally,  to  avoid  polluting  programs  with  routine  annotations,  typically  the  global  invariants,  the 
first  layer  of  the  system  assumes  that  such  invariants  hold,  and  the  second  layer  requires  to  sepa¬ 
rately  prove  their  invariance  as  a  stability  check. 


5  CONCLUSIONS 

The  three  most  significant  contributions  of  this  project  -  (1)  a  compiler  infrastructure  aware  of  con¬ 
currently  executing  runtime  managed  services  amenable  for  formal  verification  and  mechanized 
proofs;  (2)  a  formalization  of  the  Java  cookbook  that  proves  the  soundness  of  compilation  schemes 
from  Java  source  programs  to  weak  memory  architectures  like  Power;  and,  (3)  a  fully  verified 
implementation  of  a  concurrent  garbage  collector  built  using  concepts  derived  from  (1)  and  (2) 
validate  the  thesis  underlying  the  proposed  effort.  All  proofs  have  been  verified  in  the  Coq  proof 
assistant  and  are  publically  available. 
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7  List  of  Symbols,  Abbreviations,  and  Acronyms 


CAS  -  Compare  and  Set 
DRF  -  Data-Race  Freedom 
GC  -  Garbage  Collection 

IRIW  -  Independent  Reads  of  Independent  Writes 

IR  -  Intermediate  Representation 

JMM  -  Java  Memory  Model 

JVM  -  Java  Virtual  Machine 

MIR  -  Managed  Intermediate  Representation 

PSO  -  Partial  Store  Ordering 

RG  -  Rely-Guarantee 

RMO  -  Relaxed  Memory  Ordering 

SC  -  Sequentially  Consistent 

TSO  -  Total  Store  Ordering 

WRC  -  Write-Read  Conflict 
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